Analysing NLP publication patterns

Marek June 30, 2016 Uncategorized 41 Comments

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Below is the graph of top 25 organisations and the conferences where they publish.

CMU comes out as the most prolific publisher with 305 papers. A close second is Microsoft with 302 publications, also leading in the industry category. I was somewhat surprised to find that Microsoft publishes so much, almost twice as many papers compared to Google, especially as Google seems to get much more publicity with their research. Stanford is also among the top 3 organisations that publish substantially more than others. Edinburgh and Cambridge represent the UK camp with 121 and 117 papers respectively.

When we look at the distribution of conferences, Princeton and UCL stand out as having very little NLP-specific research, with nearly all of their papers in ICML and NIPS. Stanford, Berkeley and MIT also seem to focus more on machine learning algorithms. In contrast, Edinburgh, Johns Hopkins and University of Maryland have most of their publications on NLP-related conferences. CMU, Microsoft and Columbia are the most balanced among the top publishers, with roughly 50:50 division between NLP and ML.

We can also plot the number of publications per year, focusing on the top 15 institutions.

Carnegie Mellon has a very good track record, but has only just recently overtaken Microsoft as the top publisher. Google, MIT, Berkeley, Cambridge and Princeton have also stepped up their publishing game, showing upward trends in the recent years. The sudden drop for 2016 is due to incomplete data – at the time of writing, ACL, EMNLP and NIPS papers for this year are not available yet.

Now let’s look at the same graphs but for individual authors.

Chris Dyer comes out on top with 50 papers. This result is even more impressive given that he started with just 2 papers in 2012, then rocketing to the top by quite a margin in 2015. Almost all of his papers are in NLP conferences, with only 1 paper each for NIPS and ICML. Noah Smith, Chris Manning and Dan Klein rank 2nd-4th, with more stable publishing records, but also focusing mainly on NLP conferences. In contrast, Zoubin Ghahramani, Yoshua Bengio and Lawrence Carin are focused mostly on machine learning algorithms.

There seems to be a clear separation between the two research communities, with researchers specialising to publishing either in NLP or ML. This seems somewhat unexpected, especially considering the widespread trend of publishing novel neural network architectures for NLP tasks. Both fields would probably benefit from slightly tighter integration in the future.

I hope this little analysis was interesting to fellow researchers. I’m happy to post an update some time in the future, to see how things have changed. In the meantime, let me know if you find any bugs in the statistics.

Update: As requested, I’ve also added the statistics for first authors with highest publication counts. Jiwei Li from Stanford towers above others with 14 publications. William Yang Wang (CMU), Young-Bum Kim (Microsoft), Manaal Faruqui (CMU), Elad Hazan (Princeton), and Eunho Yang (IBM) have all managed an impressive 9 first-author publications.

Update 2: Added a fix for Jordan Boyd-Graber who publishes under Jordan L. Boyd-Graber in NIPS.

Update 3: Added a fix for Hal Daumé III, mapping together different spellings.

Update 4: By showing top N authors on the graphs, some authors with equal numbers of publications were being excluded. I’ve adjusted the value N for each graph so this doesn’t happen.

Update 5: Added a fix for Pradeep K. Ravikumar who also publishes under Pradeep Ravikumar.

Update 6: Added fixes to capture name variations for INRIA.

41 Comments

Jay
June 30, 2016 at 5:51 pm
Reply

Do you make the data public? I am curious how Imperial College London performs for machine learning
- Marek Post author
  June 30, 2016 at 6:29 pm
  Reply
  
  The full data on organisations is quite noisy at the lower ranks at the moment, as it is extracted from pdfs and then post-processed with manual rules. It still contains a long tail of alternative spellings and entries that are not institutions at all (eg College Park).
  Imperial College London comes up with 7 entries in there. Although worth noting that I’m only looking at 6 specific conferences, and Imperial seems to be publishing in somewhat different areas.
Jordan Boyd-Graber
June 30, 2016 at 9:05 pm
Reply

I’m someone who tries to straddle both ML and NLP, but it seems that my ML persona isn’t getting matched for NIPS. There I’m Jordan L. Boyd-Graber:

https://papers.nips.cc/author/jordan-l-boyd-graber-6725

(This should give me some purple on the bar graphs and give me a little more ML cred.)
- Marek Post author
  June 30, 2016 at 9:48 pm
  Reply
  
  Thanks! Indeed, I’m not catching alternative names for authors at the moment. I will update it soon and add a fix for your name.
Jason Eisner
June 30, 2016 at 10:39 pm
Reply

How about including TACL? It’s a journal, but deliberately set up to be another mechanism for publishing normal ACL-style papers, so leaving it out of the analysis is strange. The format is essentially the same as ACL/NAACL/EMNLP/EACL, and you get to present the work at one of those conferences. Downloading and scraping the papers should be no different than for ACL. Whether you submit via TACL or directly via the conferences is as much a matter of when the deadlines fall as anything else. (Although TACL papers arguably should count a bit more: they generally get more thorough reviews, are often required to make revisions for final acceptance, and tend to be longer.)

There’s also a question of whether long-form journal papers (JMLR, CL, etc.) should be included in measures of productivity. Perhaps those are often just synthesizing and expanding previously published conference papers? – but I’m not sure.

Of course, I hope that no one optimizes for your ranking.
- Marek Post author
  June 30, 2016 at 10:59 pm
  Reply
  
  The 6 conferences I chose simply based on which sources I personally follow the most. I completely agree that there are many other conferences and journals that could be included: TACL, COLING, CoNLL, *Sem, IJCAI, IJNLP, LREC, JMLR, CL, CIKM, AAAI, WWW, etc.
  I intend to post an update at the end of the year, and will include a longer list of conferences. Feel free to suggest additional sources which I haven’t listed yet.
  - Wei Xu
    July 1, 2016 at 3:58 am
    Reply
    
    I second Jason. TACL is essentially equal to ACL/NAACL/EMNLP/EACL; it is quite different from COLING, CoNLL, *Sem, IJCAI, IJNLP, LREC, JMLR, CL, CIKM, AAAI, WWW, etc, and much more right in the center of NLP research. I would recommend anyone interested in NLP to follow TACL papers (if not more closely) in addition to ACL/NAACL/EMNLP/EACL.
- Adelaide
  July 18, 2016 at 10:58 pm
  Reply
  
  I wanted to visit and allow you to know how , very much I trrusaeed discovering your blog today. We would consider it a good honor to work at my office and be able to utilize tips discussed on your web site and also engage in visstori’ comments like this. Should a position associated with guest writer become on offer at your end, please let me know.
- http://getquotes.liquorisquicker.net/qbe_sun_prairie_wi.xml
  July 20, 2016 at 2:36 pm
  Reply
  
  Absolutely stunning photographs! It’s not often that geology and geological formations get the attention they deserve. Mother Nature rocks! The erosion provided by the action of the waves merely uncovers what’s under our feet, and admittedly makes it all the more interesting to look at! Great post!
- http://insuremycar.dynddns.us/first_insurance_auto_hawaii.xml
  July 20, 2016 at 3:17 pm
  Reply
  
  dan napsal:pÅ™eju martinovi aby to nebyla loÄ na jedno pouÅ¾itÃ jako mÃ© smolnÃ© salto:-) reklamace III salta se stÃ¡le Å™eÅ¡Ã.. vlastnÄ› uÅ¾ trochu ztrÃ¡cÃm pÅ™ehled kolikÃ¡tÃ½ mÄ›sÃc….
- http://www./
  October 24, 2016 at 1:47 am
  Reply
  
  CRICKETNight falls, and all comes to rest as best as can be allowed. The shroud of Autumn lurks and works its way into this scene. Serene and sedate. The late summer air is soothed by symphonic sounds. A soft chirp begins the overture, and itâ€™s for sure that it will play until morning. The strains are lilting, never wilting or reaching crescendo, a slow and steady melody. Music of the night. hidden musicianplaying through the gentle nightdelight in your song
- kfz haftpflicht versicherungs vergleich
  November 3, 2016 at 10:59 am
  Reply
  
  ***Geisha-chan***1 marzo, 2010 alle 21:15:34HAHAHA!!Speriamo almeno che quello per le ragazze abbia almeno immagini sexy dei boys di Code Geass, se no non lo compro manco morta…Voglio i disegni delle mitiche CLAMP perÃ²!!
- http://www./
  November 3, 2016 at 8:19 pm
  Reply
  
  asri persoalkan why non muslim parti roket n kapal tenggelam tu xnak terima hudud?thats why die suruh abby bace dengan teliti tweet die. WTH,TF abby tu nak spam tweet ust asri mcm tu.BS!Well-loved.
- Gábor Berend
  January 6, 2017 at 1:02 am
  Reply
  
  Recently I made a similar analysis focusing on TACL you can find here.
Matthias Gallé
July 1, 2016 at 11:39 am
Reply

Is this long papers only or long+short?
- Marek Post author
  July 1, 2016 at 2:17 pm
  Reply
  
  Looking at both long and short papers.
Graham Neubig
July 1, 2016 at 6:45 pm
Reply

Cool analysis! I’d also be interested in what happens when you remove ICML/NIPS from the mix and focus solely on NLP venues.
- Francois Yvon
  July 2, 2016 at 6:10 pm
  Reply
  
  A similar enterprise encompassing all LREC + ACL + ISCA papers and more… since the 60s by Mariani and colleagues:
  https://www.semanticscholar.org/paper/NLP4NLP-Applying-NLP-to-Scientific-Corpora-about-Francopoulo-Mariani/5b9ca3b8b6e6290c810ebf95ce90a626dfe1bc34/pdf
  - Marek Post author
    July 2, 2016 at 6:20 pm
    Reply
    
    Looks interesting, but this only describes the methodology. Are the results available somewhere?
    - Joseph Mariani
      July 11, 2016 at 8:45 am
      Reply
      
      Ypu may find an example in the analysis of the LREC conference at LREC 2014 :
      http://www.lrec-conf.org/proceedings/lrec2014/pdf/1228_Paper.pdf
      and first analyses on the NLP4NLP corpus, comprising 34 sources in NLP over 50 years, at LREC2016 such as: http://www.lrec-conf.org/proceedings/lrec2016/pdf/89_Paper.pdf
Ryan
July 4, 2016 at 1:08 am
Reply

Thanks for the nice post, but some of the numbers seem off, and the errors may be related to parsing Chinese names. For example, Yuchen Zhang does not have an EMNLP, and there are at least two Yuxin Chen working in this area but neither of them has 7 ICML+NIPS alone. Perhaps you double-counted other people named Y. Zhang or Y. Chen?
- Marek Post author
  July 4, 2016 at 1:57 am
  Reply
  
  There is an EMNLP paper from Yuchen Zhang: http://aclweb.org/anthology/D/D14/D14-1204.pdf
  Since I’m just doing string matching, this might not necessarily be the same Yuchen Zhang who publishes in ICML.
  If you have ideas on how to approach this issue, I’m open to suggestions.
Jochen L Leidner
July 7, 2016 at 3:17 pm
Reply

Nice inforgraphic, thanks! Immediate feature requests: How about patents? Including IR? Speech? Top single-authors? Or which university fosters most team-coauthoring? CItation impact per institution?
- Marek Post author
  July 7, 2016 at 10:41 pm
  Reply
  
  Thanks, these are good ideas. I’ll consider these for the next iteration.
- Stella
  July 18, 2016 at 9:57 pm
  Reply
  
  om normal atunci cand vede ca greseste in anumite domenii nu mai continua sa le ia din nou de la capat! Tu te agati cu dintii de orice cuvant care poate avea mai multe sensuri doar pentru a-ti demonstra tie ca dumnezeul tau exista! Cum nu poti sa-ti dai seama ca ceva nu este in regula cu religia asta?! Se pare ca iti este frica sa afli adetraul!Opresve-te putin si mediteaza! oricum cu tine nu se va putea ajunge la o concluzie caci tu selectezi doar informatia care iti convine! si problema este ca pe toate informatiile le-ai luat pe nemestecate de pe siturile creationiste.
- www kreditkarte de
  October 19, 2016 at 3:47 pm
  Reply
  
  Intervju med en agent om vad som avgÃ¶r priset pÃ¥ en spelareDiskussioner kring truppen samt framfÃ¶rallt – vilka Ã¤r de potentiella nyfÃ¶rvÃ¤rven frÃ¥n Argentina? Det borde gÃ¥ att luska fram, det handlar ju trots allt bara om tvÃ¥ argentinska klubbar?
- http://www./
  November 13, 2016 at 8:26 pm
  Reply
  
  I guess finding useful, reliable information on the internet isn’t hopeless after all.
- jegs coupons
  March 22, 2017 at 10:01 am
  Reply
  
  Peace!I had the graceful opportunity to hear the brothers in Lord on the “3a. Igreja do Evangelho Quadrangular” in Curitiba. We listen a encourage word and watched a movie about Abraham and Isaac.
EXG
July 7, 2016 at 5:01 pm
Reply

Nice infographics! Quick comment: I believe INRIA is missing. Just by counting NIPS 2012-2015, I get more than 60 papers.
- Marek Post author
  July 7, 2016 at 10:43 pm
  Reply
  
  Good point, thanks for letting me know. I’ve added a fix for mapping together different ways of naming INRIA. They are now featured in the top 25.
Sadid Hasan
July 8, 2016 at 2:21 pm
Reply

Very useful analysis! Another important conference to look at is: INLG.
Looking forward to your updates with more conferences 🙂
Trevor Cohn
July 12, 2016 at 2:11 am
Reply

Interesting analysis, thanks for making this public. Also related is the ACL anthology network, which includes citation analysis over the conferences/journals in the ACL anthology ({NA,E,}ACL, EMNLP, COLING etc). Sadly it hasn’t been updated for 3 years.
John
January 23, 2017 at 3:44 am
Reply

I wonder why you have included ICML and NIPS into your analysis. There is some spillover from ML into NLP and vice versa but generally within the NLP community, only the big four (ACL, NAACL, EMNLP, and EACL) matter. The other two are really machine learning conferences and are not that much of interest to researchers in Computational Linguistics/NLP, so the data from NIPS and ICML are more like noise and don’t give you much information on current trends in the field.
- Marek Post author
  January 23, 2017 at 3:59 am
  Reply
  
  I chose the conferences that influence my work the most. Totally subjective, I agree. On the spectrum of linguistics-NLP-ML, I am more on the ML side.
buy electronics, house electronics, computers, music, software, games, appliances, parts, laptops, stores, accessories, tvs, purchase hardware, house gadgets, PCs, music, programming, diversions, machines, parts, tablets, stores, extras, purchase iphone,
September 29, 2017 at 6:35 am
Reply

You’re so interesting! I do not suppose I have read through a single thing like that before. So nice to discover somebody with a few unique thoughts on this topic. Seriously.. thank you for starting this up. This web site is something that is required on the web, someone with a bit of originality!
Chrisrum
October 17, 2017 at 10:24 am
Reply

test soft
Lisa Lauder
October 20, 2019 at 12:45 am
Reply

Mega schön
online psychic
January 17, 2020 at 4:21 pm
Reply

What i don’t understood is if truth be told how you are not really
a lot more smartly-liked than you might be right now.
You are very intelligent. You understand therefore considerably
relating to this subject, made me individually believe it from numerous various angles.

Its like men and women don’t seem to be interested except it is one thing
to accomplish with Lady gaga! Your individual stuffs nice.
All the time care for it up!
new website
February 3, 2020 at 3:48 am
Reply

Humans need to attach with other humans. This is why
it important that obtain people to opt-in and add their names towards the distribution email list.
It’s also a computer to help the members make contact with each other online.
However, there is a strategy to utilizing this system to
its full plausible. Hardly ever meet through social media outlets before they meet at events.

Four years later, Facebook is still a crucial factor regarding growth of
Working Women of Tampa Bay and dealing Women of Central Minnesota.

But how do you implement it on world wide web?
It may look like likely to makes us better consumers able to take advantage in excess
of deals but be wary of the new layer of noise and data being thrown at you while you have a the associated
with shopping searching to make wise moves. One would prefer
to buy something from someone he knows because buying from a stranger could
make him anxious. So, web credibility’s pretty important in this case.

Groupons, FB Marketplace, location based coupons on your own desktop or to your smartphone, are
just starting to go mainstream and will certainly grow substantially in the
long run. This way, they along with an assurance of their trust in addition to their money.
I myself have a mentor that provides the program and he’s seen significant gains inside the Twitter following in just five a long
time. Advertising can help your company more visible, but advertising is costly – especially in a competitive market like
web web hosting service.

While slim down you formerly brought the Facebook hashtag to the site,
it’s got not enabled the linking functionality.
In this article I like to talk about what I really believe is basically “white lie” in a mans enhancement
universe. Now, before I go further, I’ve got to stress
that a lot of of these false claims falls under the category of ‘it’s obvious once
talked about how much it’.
bit.do
April 19, 2020 at 3:53 am
Reply

Pitbull, have y’guys went Baccarat in the LasVegas,and had attacked $423000-around,btw?

Then, att,’They were, ALL-Worse. it’s sth like.
It’s the 5 floor’s story,btw
Be careful Bed ..th, and M in LasVegas luvaz,’There are, Those ハゲ頭and those-staffs,
anw,mf,assho
boca restaurant
May 3, 2020 at 8:09 pm
Reply

Thanks for sharing such a good thinking, post is pleasant, thats
why i have read it fully

Analysing NLP publication patterns

41 Comments

Leave a Reply Cancel reply